
Moonshot AI ยท Chat / LLM ยท 1T Parameters (32B Active) ยท 256K Context

Streaming Reasoning Chain-of-Thought Agentic Coding Tool Orchestration Long ContextOverview
Kimi K2 Thinking is the flagship open-weights reasoning model from Moonshot AI โ a Chinese AI research company focused on building large-scale foundation models with advanced agentic capabilities. It is the first open-source model to outperform leading closed-source models including GPT-5 and Claude 4.5 Sonnet across major benchmarks โ HLE (44.9%), BrowseComp (60.2%), and SWE-Bench Verified (71.3%). Built on a 1T parameter sparse MoE architecture with 32B active per token and native INT4 quantization via QAT, it runs at 2x the speed of FP8 deployments. The model maintains stable tool-use across 200โ300 sequential calls within a 256K context window, with interleaved chain-of-thought and dynamic tool calling for complex agentic workflows. Served instantly via the Qubrid AI Serverless API.๐ First open-source to beat GPT-5 and Claude 4.5 Sonnet. 1T MoE. 2x FP8 speed. Deploy on Qubrid AI โ no 512GB RAM cluster required.
Model Specifications
| Field | Details |
|---|---|
| Model ID | moonshotai/Kimi-K2-Thinking |
| Provider | Moonshot AI |
| Kind | Chat / LLM |
| Architecture | Sparse MoE Transformer โ 1T total / 32B active per token, 61 layers (1 dense), 384 experts (8 selected per token), MLA attention, SwiGLU |
| Parameters | 1T total (32B active per forward pass) |
| Context Length | 256,000 Tokens |
| MoE | No |
| Release Date | November 2025 |
| License | Modified MIT License |
| Training Data | Large-scale diverse dataset with agentic reasoning trajectories; INT4 Quantization-Aware Training (QAT) in post-training |
| Function Calling | Not Supported |
| Image Support | N/A |
| Serverless API | Available |
| Fine-tuning | Coming Soon |
| On-demand | Coming Soon |
| State | ๐ข Ready |
Pricing
๐ณ Access via the Qubrid AI Serverless API with pay-per-token pricing. No infrastructure management required.
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.60 |
| Input Tokens (Cached) | $0.30 |
| Output Tokens | $2.50 |
Quickstart
Prerequisites
- Create a free account at platform.qubrid.com
- Generate your API key from the API Keys section
- Replace
QUBRID_API_KEYin the code below with your actual key
โ ๏ธ Temperature note: Always use temperature=1.0 for Kimi K2 Thinking โ this is the recommended setting for all tasks and benchmark-consistent performance.
Python
JavaScript
Go
cURL
Live Example
Prompt: What are the benefits of renewable energy?
Response:
Playground Features
The Qubrid AI Playground lets you interact with Kimi K2 Thinking directly in your browser โ no setup, no code, no cost to explore.๐ง System Prompt
Define the modelโs reasoning depth, role, and tool-use constraints before the conversation begins โ essential for long-horizon agentic research workflows and multi-step coding sessions.Set your system prompt once in the Qubrid Playground and it applies across every turn โ including stable reasoning state across extended multi-step sessions.
๐ฏ Few-Shot Examples
Guide the modelโs reasoning style and output format with concrete examples โ no fine-tuning, no retraining required.| User Input | Assistant Response |
|---|---|
Find all bugs in this Python function and fix them | Bug 1 (line 4): Off-by-one error โ range(len(arr)) should be range(len(arr)-1). Bug 2 (line 7): Division by zero not handled โ add: if denominator == 0: return None. Fixed function: [corrected code] |
Prove that logโ(3) is irrational | Assume logโ(3) = p/q (rational, lowest terms). Then 2^(p/q) = 3 โ 2^p = 3^q. Left side is even, right side is odd. Contradiction. Therefore logโ(3) is irrational. โ |
๐ก Stack multiple few-shot examples in the Qubrid Playground to establish reasoning format and output structure โ no fine-tuning required.
Inference Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| Streaming | boolean | true | Enable streaming responses for real-time output |
| Temperature | number | 1 | Recommended temperature is 1.0 for Kimi K2 Thinking |
| Max Tokens | number | 16384 | Maximum number of tokens to generate |
| Top P | number | 0.95 | Controls nucleus sampling |
Use Cases
- Complex agentic research workflows
- Long-horizon coding and debugging
- Advanced mathematical reasoning
- Multi-step tool orchestration
- Autonomous writing and analysis
- Scientific reasoning tasks
Strengths & Limitations
| Strengths | Limitations |
|---|---|
| First open-source model to beat GPT-5 and Claude 4.5 Sonnet on open benchmarks | Requires 512GB+ RAM for full self-hosted deployment |
| 1T MoE with only 32B active per token โ frontier reasoning at efficiency | ~600GB model size โ large infrastructure needed for self-hosting |
| Native INT4 via QAT โ 2x speed vs FP8 with no accuracy loss | Thinking mode means higher latency than non-reasoning models |
| Interleaved chain-of-thought with dynamic tool calling | Temperature must be set to 1.0 for recommended performance |
| Stable across 200โ300 sequential tool calls | Function calling not supported via API |
| 256K context window for long-horizon agentic sessions |
Why Qubrid AI?
- ๐ No infrastructure setup โ 1T MoE served serverlessly, pay only for what you use
- ๐ OpenAI-compatible โ drop-in replacement using the same SDK, just swap the base URL
- ๐ฐ Cached input pricing โ $0.30/1M for cached tokens, critical for long agentic sessions with repeated context
- ๐ง Frontier reasoning on demand โ access the first open-source model to beat GPT-5 without managing a 600GB deployment
- ๐งช Built-in Playground โ prototype with system prompts and few-shot examples instantly at platform.qubrid.com
- ๐ Full observability โ API logs and usage tracking built into the Qubrid dashboard
Resources
| Resource | Link |
|---|---|
| ๐ Qubrid Docs | docs.platform.qubrid.com |
| ๐ฎ Playground | Try Kimi K2 Thinking live |
| ๐ API Keys | Get your API Key |
| ๐ค Hugging Face | moonshotai/Kimi-K2-Thinking |
| ๐ฌ Discord | Join the Qubrid Community |
Built with โค๏ธ by Qubrid AI
Frontier models. Serverless infrastructure. Zero friction.
Frontier models. Serverless infrastructure. Zero friction.